Data Wrangling

Lennart Kasserra

CorrelAid

2024-04-22

Setup

library(dplyr)
library(tidyr)

bigfoot <- readr::read_csv(here::here("data/bigfoot.csv"))

# Or if that's not working:
bigfoot <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-09-13/bigfoot.csv")

Motivation

So far we have taken the data set at face value; but if we explore we may find some issues. For example: what is “Unknown” for the season?

bigfoot |> count(season)
## # A tibble: 5 × 2
##   season      n
##   <chr>   <int>
## 1 Fall     1492
## 2 Spring    825
## 3 Summer   1867
## 4 Unknown    92
## 5 Winter    745

Nonsensical records & missing values

The easiest way of dealing with this would be to convert season to NA if it is “Unknown”:

bigfoot |> 
  mutate(season = season |> na_if("Unknown")) |> 
  count(season)
## # A tibble: 5 × 2
##   season     n
##   <chr>  <int>
## 1 Fall    1492
## 2 Spring   825
## 3 Summer  1867
## 4 Winter   745
## 5 <NA>      92

But it seems for some of the unknown we actually have a date:

bigfoot |> 
  filter(season == "Unknown", !is.na(date)) |> 
  select(date, season)
## # A tibble: 64 × 2
##    date       season 
##    <date>     <chr>  
##  1 1971-10-01 Unknown
##  2 1979-07-18 Unknown
##  3 1980-06-30 Unknown
##  4 1957-09-18 Unknown
##  5 1992-01-15 Unknown
##  6 1997-05-01 Unknown
##  7 1956-01-01 Unknown
##  8 1962-06-06 Unknown
##  9 1977-01-01 Unknown
## 10 1986-06-01 Unknown
## # ℹ 54 more rows

Nonsensical records & missing values: imputation

Let’s break the problem down (one way):

  • Imputing season from date:
    1. Extract month from date
    2. Set season to e.g. fall if month is october or november, june to september would be summer, etc. etc.

Nonsensical records & missing values: imputation

Back to our problem:

bigfoot |> 
  mutate(
    # First add the month:
    month = lubridate::month(date),
    # Then impute:
    season = case_when(
      season != "Unknown" ~ season, # if season not unknown just keep it
      month %in% c(3:5) ~ "Spring", # otherwise step through these...
      month %in% c(6:9) ~ "Summer",
      month %in% c(10:11) ~ "Fall",
      month %in% c(12, 1, 2) ~ "Winter",
      .default = NA                 # ...and to this if no condition matches.
    )
  )

case_when()

Handling missing values

Drop all missing values:

bigfoot |> drop_na()

Drop obs. where a certain variable is missing:

bigfoot |> drop_na(season)

# Or use `filter()`
bigfoot |> filter(!is.na(season))

Drop obs. where a of a number of variables is missing:

bigfoot |> drop_na(season, starts_with("temperature"))

# Or:
bigfoot |> filter(!if_any(c(season, starts_with("temperature")), is.na))
# ^ i.e. don't keep row `if_any()` of these columns are missing (`NA`)!

Real world data is… messy

fertility <- readr::read_csv(here::here("data", "wbi_fertility.csv"))
fertility
## # A tibble: 266 × 65
##    CountryName      CountryCode `1960` `1961` `1962` `1963` `1964` `1965` `1966`
##    <chr>            <chr>       <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr> 
##  1 Afghanistan      AFG         138.8… 138.7… 138.4… 138.1… 140.1… 140.9… 142.8…
##  2 Albania          ALB         59.371 56.719 47.62  40.433 39.268 41.521 42.389
##  3 Algeria          DZA         123.1… 124.1… 123.9… 123.6… 123.2… 121.6… 119.3…
##  4 American Samoa   ASM         53.75  56.67  58.531 59.651 56.865 55.951 56.091
##  5 Andorra          AND         14.624 15.517 16.559 17.461 18.334 18.214 18.232
##  6 Angola           AGO         148.2… 150.2… 150.8… 153.0… 155.6… 156.3… 157.45
##  7 Antigua and Bar… ATG         130.75 129.6… 129.6… 130.6… 130.3… 128.5… 127.2…
##  8 Argentina        ARG         54.977 54.219 55.718 55.292 56.305 57.4   58.684
##  9 Armenia          ARM         50.24  44.673 35.681 33.444 38.036 41.236 43.512
## 10 Aruba            ABW         62.182 61.589 61.17  59.658 58.324 56.291 54.201
## # ℹ 256 more rows
## # ℹ 56 more variables: `1967` <chr>, `1968` <chr>, `1969` <chr>, `1970` <chr>,
## #   `1971` <chr>, `1972` <chr>, `1973` <chr>, `1974` <chr>, `1975` <chr>,
## #   `1976` <chr>, `1977` <chr>, `1978` <chr>, `1979` <chr>, `1980` <chr>,
## #   `1981` <chr>, `1982` <chr>, `1983` <chr>, `1984` <chr>, `1985` <chr>,
## #   `1986` <chr>, `1987` <chr>, `1988` <chr>, `1989` <chr>, `1990` <chr>,
## #   `1991` <chr>, `1992` <chr>, `1993` <chr>, `1994` <chr>, `1995` <chr>, …

Reshaping / pivotting

  • The problem is that this data is wide (compared to the long format we like to work with).

Use pivot_longer() to fix this:

fertility <- 
  fertility |> 
  pivot_longer(-starts_with("Country"), names_to = "year", values_to = "fertility")

fertility
## # A tibble: 16,758 × 4
##    CountryName CountryCode year  fertility
##    <chr>       <chr>       <chr> <chr>    
##  1 Afghanistan AFG         1960  138.876  
##  2 Afghanistan AFG         1961  138.717  
##  3 Afghanistan AFG         1962  138.494  
##  4 Afghanistan AFG         1963  138.173  
##  5 Afghanistan AFG         1964  140.107  
##  6 Afghanistan AFG         1965  140.935  
##  7 Afghanistan AFG         1966  142.821  
##  8 Afghanistan AFG         1967  143.154  
##  9 Afghanistan AFG         1968  142.341  
## 10 Afghanistan AFG         1969  141.28   
## # ℹ 16,748 more rows
  • Take all columns that don’t start with “Country”, put their names into column called year, and the values into a column called fertility.

Exercise time

  • Figure out how to use pivot_wider() to reverse this (Tip: you can always access the documentation for a function, with ?, e.g. here ?pivot_wider).

Joining

We now have our data on fertility set, but we want to join it with data on female education to analyze the relationship between the two.

  • How do we “join” (or “merge”) two different dataframes?

Joining

  1. Get both of them into R & into the right format

Joining

educ <- readr::read_csv(here::here("data/wbi_education.csv"))
educ <- 
  educ |> 
  pivot_longer(-starts_with("Country"), names_to = "year", values_to = "out_of_school") |> 
  select(-CountryName) # also in the other one

Joining

  1. Get both of them into R & into the right format
  2. Decide on which type of join to do

Joining: types of joins

Joining

  1. Get both of them into R & wrangle them into the right format
  2. Decide on which type of join to do
  3. Join!

Joining

Here, we decide on an full_join(), initially keeping all observations:

full <- 
  fertility |> 
  full_join(educ, by = c("CountryCode", "year")) |> # <-
  mutate(across(c(fertility, out_of_school), as.numeric)) |> # make sure they have the right type
  drop_na(fertility, out_of_school)

full
## # A tibble: 2,404 × 5
##    CountryName CountryCode year  fertility out_of_school
##    <chr>       <chr>       <chr>     <dbl>         <dbl>
##  1 Afghanistan AFG         1974     143.           94.6 
##  2 Albania     ALB         2000      14.8           7.58
##  3 Albania     ALB         2001       9.70          4.59
##  4 Albania     ALB         2017      16.6           2.84
##  5 Albania     ALB         2020      14.7           5.30
##  6 Albania     ALB         2021      14.5           6.10
##  7 Algeria     DZA         1984      45.9          49.6 
##  8 Algeria     DZA         1995      19.1          32.7 
##  9 Algeria     DZA         1996      16.7          32.5 
## 10 Algeria     DZA         1997      13.3          27.8 
## # ℹ 2,394 more rows

Modelling & presenting results

A key part of your thesis or term papers will probably be to fit a model & present the results.

Modelling & presenting results

Functionality for many simple model types is built into R (like lm() to fit a linear model)

model <- lm(fertility ~ out_of_school, data = full)

Modelling & presenting results

Now we are left with a model-object; we can inspect the results in R using summary():

summary(model)
## 
## Call:
## lm(formula = fertility ~ out_of_school, data = full)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -115.768  -20.834   -5.979   18.438  107.145 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   25.77240    0.84729   30.42   <2e-16 ***
## out_of_school  1.53311    0.03043   50.38   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31.74 on 2402 degrees of freedom
## Multiple R-squared:  0.5138, Adjusted R-squared:  0.5136 
## F-statistic:  2538 on 1 and 2402 DF,  p-value: < 2.2e-16

Modelling & presenting results

To present results, look into stargazer or modelsummary

stargazer::stargazer(model, type = "html", header = FALSE)
Dependent variable:
fertility
out_of_school 1.533***
(0.030)
Constant 25.772***
(0.847)
Observations 2,404
R2 0.514
Adjusted R2 0.514
Residual Std. Error 31.739 (df = 2402)
F Statistic 2,538.193*** (df = 1; 2402)
Note: p<0.1; p<0.05; p<0.01

Modelling & presenting results

For most other model types, R has some kind of package. Here is for example a one-way fixed-effects model using fixest:

model <- fixest::feols(
  fertility ~ out_of_school,
  data = full,
  fixef = "CountryCode",
  vcov = ~CountryCode
)

summary(model)
## OLS estimation, Dep. Var.: fertility
## Observations: 2,404
## Fixed-effects: CountryCode: 189
## Standard-errors: Clustered (CountryCode) 
##               Estimate Std. Error t value   Pr(>|t|)    
## out_of_school 0.893338   0.143206 6.23813 2.8614e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 11.3     Adj. R2: 0.933043
##              Within R2: 0.34102

Modelling & presenting results

An example of modelsummary:

model |> 
  list("Fertility" = _) |> 
  modelsummary::modelsummary(
    stars = c("***" = 0.01, "**" = 0.05, "*" = 0.1),
    coef_map = c("out_of_school" = "Girls out of school")
  )
tinytable_hkofdyo899xrihvg2qau
Fertility
* p < 0.1, ** p < 0.05, *** p < 0.01
Girls out of school 0.893***
(0.143)
Num.Obs. 2404
R2 0.938
R2 Adj. 0.933
R2 Within 0.341
R2 Within Adj. 0.341
AIC 18861.0
BIC 19960.1
RMSE 11.30
Std.Errors by: CountryCode
FE: CountryCode X

Modelling & presenting results

Back to our original research question: what predicts reliable Bigfoot sightings?

  1. Cleaning:
bigfoot <- 
  bigfoot |> 
  mutate(
    # Temperature to celsius:
    temp_celsius = (temperature_mid - 32) / 1.8,
    # Construct outcome:
    reliable_sighting = if_else(classification == "Class A", 1, 0)
  )

Modelling & presenting results

  1. Exploring:
bigfoot |> 
  count(reliable_sighting)
## # A tibble: 2 × 2
##   reliable_sighting     n
##               <dbl> <int>
## 1                 0  2540
## 2                 1  2481

Modelling & presenting results

  1. Modelling
bigfoot_model <- lm(
  reliable_sighting ~ temp_celsius + humidity + moon_phase + cloud_cover + uv_index,
  data = bigfoot
)

summary(bigfoot_model)
## 
## Call:
## lm(formula = reliable_sighting ~ temp_celsius + humidity + moon_phase + 
##     cloud_cover + uv_index, data = bigfoot)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.5946 -0.4875 -0.4161  0.5074  0.6301 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.551415   0.061511   8.965  < 2e-16 ***
## temp_celsius  0.003613   0.001399   2.583  0.00985 ** 
## humidity     -0.156548   0.078578  -1.992  0.04644 *  
## moon_phase    0.012315   0.032245   0.382  0.70254    
## cloud_cover   0.097971   0.037737   2.596  0.00948 ** 
## uv_index     -0.009106   0.005072  -1.795  0.07269 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4991 on 2882 degrees of freedom
##   (2133 Beobachtungen als fehlend gelöscht)
## Multiple R-squared:  0.005269,   Adjusted R-squared:  0.003543 
## F-statistic: 3.053 on 5 and 2882 DF,  p-value: 0.009409

Presenting

bigfoot_model |> 
  list("Reliable Sighting" = _) |> 
  modelsummary::modelsummary(
    stars = c("***" = 0.01, "**" = 0.05, "*" = 0.1),
    coef_map = c(
      "temp_celsius" = "Temperature (C)",
      "humidity" = "Humidity",
      "moon_phase" = "Moon Phase",
      "cloud_cover" = "Cloud Cover",
      "uv_index" = "UV Index",
      "Intercept" = "Constant"
    )
  )
tinytable_f7zhdreg9qd9ckjczclm
Reliable Sighting
* p < 0.1, ** p < 0.05, *** p < 0.01
Temperature (C) 0.004***
(0.001)
Humidity -0.157**
(0.079)
Moon Phase 0.012
(0.032)
Cloud Cover 0.098***
(0.038)
UV Index -0.009*
(0.005)
Num.Obs. 2888
R2 0.005
R2 Adj. 0.004
AIC 4189.6
BIC 4231.4
Log.Lik. -2087.791
F 3.053
RMSE 0.50

Final exercise

There is data on UFO-sightings attached:

ufo <- readr::read_csv(here::here("data/ufo_sightings.csv"))
ufo
## # A tibble: 96,429 × 12
##    reported_date_time  reported_date_time_utc posted_date city        state     
##    <dttm>              <dttm>                 <date>      <chr>       <chr>     
##  1 2022-08-29 06:03:00 2022-08-29 06:03:00    2022-09-09  Pinehurst   NC        
##  2 2022-08-20 01:51:00 2022-08-20 01:51:00    2022-10-08  Rapid City  MI        
##  3 2022-08-13 05:30:00 2022-08-13 05:30:00    2022-09-09  Cleveland   OH        
##  4 2022-08-06 21:00:00 2022-08-06 21:00:00    2022-09-09  Bloomington IN        
##  5 2022-08-04 07:40:00 2022-08-04 07:40:00    2022-09-09  Irvine      CA        
##  6 2022-07-22 16:00:00 2022-07-22 16:00:00    2022-09-09  Moore       OK        
##  7 2022-07-19 16:27:00 2022-07-19 16:27:00    2022-09-09  Short Pump  VA        
##  8 2022-07-14 18:56:00 2022-07-14 18:56:00    2022-09-09  Norwalk     CT        
##  9 2022-07-13 19:40:00 2022-07-13 19:40:00    2022-09-09  Blayney     New South…
## 10 2022-07-13 04:10:00 2022-07-13 04:10:00    2022-09-09  Greybull    WY        
## # ℹ 96,419 more rows
## # ℹ 7 more variables: country_code <chr>, shape <chr>, reported_duration <chr>,
## #   duration_seconds <dbl>, summary <chr>, has_images <lgl>, day_part <chr>
  • Explore the data
    • What is the most common sighting (shape)?
    • What country has the most sightings? Which state has the most?
    • What year had the most sightings (hint: use lubridate::year() to extract the year from a date)
    • What were the top 10 countries/years?